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Abstract 

Existing Bayesian models, especially nonparametric Bayesian methods, rely heavily on 
specially conceived priors to incorporate domain knowledge for discovering improved latent 
representations. While priors can affect posterior distributions through Bayes' theorem, 
imposing posterior regularization is arguably more direct and in some cases can be more 
natural and easier. In this paper, we present regularized Bayesian inference (RegBayes), a 
computational framework to perform posterior inference with a convex regularization on the 
desired post-data posterior distributions. RegBayes covers both directed Bayesian networks 
and undirected Markov networks whose Bayesian formulation results in hybrid chain graph 
models. When the convex regularization is induced from a linear operator on the posterior 
distributions, RegBayes can be solved with convex analysis theory. Furthermore, we present 
two concrete examples of RegBayes, infinite latent support vector machines (iLSVM) and 
multi-task infinite latent support vector machines (MT-iLSVM), which explore the large- 
margin idea in combination with a nonparametric Bayesian model for discovering predictive 
latent features for classification and multi-task learning, respectively. We present efficient 
inference methods and report empirical studies on several benchmark datasets, which ap- 
pear to demonstrate the merits inherited from both large-margin learning and Bayesian 
nonparametrics. Such results were not available until now, and contribute to push forward 
the interface between these two important subfields, which have been largely treated as 
isolated in the community. 

Keywords: Bayesian inference, regularization, Bayesian nonparametrics, large-margin 
learning, classification, multi-task learning 



1. Introduction 



Bayesian inference, one of the elegant statistical estiniation frameworks, is becom ing increas- 
ingly popular not only in building artificial systems (jPearll . Il988l : iBishopl . l2006ll that handle 
unce r tainties but also in the efforts to develop a theory o f how the brain works ( Ernst and Bank^ . 



2002 : iKnill and Pougetl . 12004 : iTenenbaum et alj . 120111 ). At the core of the Bayesian way 
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of thinking is the Bayes' theorem (aka Bayes' rule), which offers a mathematically rigor- 
ous computational mechanism to "reverse engineer" a physical generative process to find 
the distribution of the hidden structures that likely generated the observed data. Re- 
cently, nonparametric Bayesian models have gained remarkable popularity, partly owing 
to their desirable "nonparametric" nature which allows practitioners to sidestep the dif- 
ficult model selection proble m, e.g., figuring out the unknown number of components (or 
classes) in a mixt ure model ( Antoniak^, ^1974^ or de termining the unknown dimensionality 
of latent features ( Griffiths an d Ghahramani, 20051 ). by using an appropriate prior distri- 
bution with a large support. Furthermore, nonparametric Bayesian models allow the model 
complexity to grow as more data are observed, which is also the key factor that makes 
nonparametric Bayesian models different from other stand ard Bayesian models. Among 
the m ost commonly used prior s are Gaussian process (GP) (IRasmussen and Ghahramani . 
2OO2I ). Dirichlet process (DP) (iFergusonl . Il973l : lAntoniakl . [l97j) and Indian buffet process 
(IBP) (jGriffiths and Ghahramanil . hoO^ ). 

However, standard nonparametric Bayesian models usually make strict and unreal- 
istic assumptions on data, such as that observations being homogeneous or exchange- 
able. A number of recent developments in Bayesian nonparametrics have attempted to 
relax such assumpti ons. For example, to handle heterogenou s observations, predictor- 
dependent processes dMacEachernl . llQod : IWilliamson et all . I2OI0I ') have been proposed; and 
to relax the exch angeability assum ption, various correlation structure s, such as hierar 



chical structures ( Teh et al. . 20061 ) . temporal or spatial de pendencies ( Beal et al. , 2002 : 



iM^^I^^,S):^stochastic ordering dependencies ^mB^^^P^M^ 
20071 ). have been introduced. Although this progress has been substantial, developing suffi- 
ciently flexible nonparametric priors still has a long way to go to meet the needs of modeling 
complex data. Furthermore, almost all the e xistin g nonparametric Bayesian methods rely 
solely on crafting or learniug ( Welling et al. . 20121 ) a nonparametric Bayesian prior encod- 
ing some special structureq^, which indirectly influence the posterior distribution of interest 
via trading-off with likelihood models through the Bayes' rule. Since it is the post-data 
posterior distributions, which capture the latent structures to be learned, that are of our 
ultimate interest, an arguably more direct way to learn a desirable latent- variable model 
is to impose posterior regularization (i.e., regularization on posterior distributions), as we 
will explore in this paper. Another reason for using posterior regularization is that in some 
cases it is more natural and easi er to incorporate side domain knowledg e or structures, such 
as the large-margin const raints (Jaakkola et a,l. . 199^ : Zhu et al. . 2009), constraints defined 
on a manifold structure (|Huh and Fienbergj . I2OIOI) or the general expectatio n constraints 
defined with various forms of side information ( Mann and McCalluml . 20ld ). directly on 
posterior distributions rather than through priors. 

Posterior regularization, usually through imposing constraints on the posterior distri- 
butions of latent variables or via some information projection, has been widely studied in 
learning a finite log- linear model from partially observed data (e.g., semi-supervised learn- 
ing and learning with side information, such as labeled features), including generalized 



1. Although likehhood function is another dimension that can be changed to incorporate domain knowledge, 
existing work on Bayesian nonparametric methods has been mainly focusing on the prior distributions. 
Following this convention, this paper assumes that a common likelihood model (e.g., Gaussian likelihood 
for continuous data) is given. 
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expectation ( Mann and McCallum . 201ol), pos terior regularization ( Ganchev et al. . 20101 ) . 
and alternating projection ( Bellare et al. . 20091 ) . all of which are doing maximum likelihood 
estimation (MLE) to learn a single set of model parameters by optimizing an objective 
that is regularized by posterior constraints. Recent attempts toward learnin g a posterior 



distribution of model parameters include the "learning from ni easurements" (jLiang et al 



20091 ). maximum entropy discrimination (| Jaakkola et al 



entropy discrimination latent Dirichlet allocation) (IZhu et al 



19991) and MedLDA (maximum 
2OO9I ). But again, all these 



methods are restricted to finite parametric models. To the best of our knowledge, very few 
attempts have been made to impose posterior regularization on nonparametric Bayesian 
latent variable models. 

Technically, although it is intuitively natural for MLE-based methods (i.e., maximiz- 
ing a likelihood-based objective function with hidden variables) to include a regularization 
term on the posterior distributions of latent variables when performing an EM-like proce- 
dure, this is not straightforward for Bayesian inference using the classic Bayes' rule because 
we do not have an optimization objective to be regularized. Although Bayesian infer- 
ence with hard posterior constraints can be heuristically implemented, e.g., using rejection 
sampling ( Bishop . 20061 ). it could be extremely inefficient when the sample space is high 
dimensional. Things will get even worse when the posterior constraints are soft, i.e., al- 
lowing some violations but the degree of violation is unknown. Soft constraints could lead 
to an uncountably many feasible subspaces (each with a different complexity or penalty), 
which make a rejection sampling method generally infeasible (Please see Figure 1(b) for an 
illustration) . 

To offer a mathematically rigorous computational framework for dealing with both hard 
and soft posterior constraints, in this paper we present a general formulation of regularized 
Bayesian inference (RegBayes), which offers an extra dimension of freedom to standard 
Bayesian inference by imposing appropriate regularization on the post-data posterior dis- 
tributions. We base our w ork on the fre sh information theoretical interpretation of the 
Bayes' theorem by Zellner ( Zellner . 19881 ). namely, the Bayes' theorem can be reformu- 
lated as a KL-divergence minimization problem. Under this optimization framework, we 
incorporate posterior constraints to do regularized Bayesian inference, with a penalty term 
that measures the violation of the constraints. RegBayes covers the broad spectrum of 
graphical models, including both directed Bayesian networks and undirected Markov net- 
works. For undirected models, the resulti ng model is a hybrid chain graph (IFrvdeiibere . 



I990I) when performing B ayesian inference (iMurray and Ghahramanil . l2004l : iQi et al 



2005; 



Welling and Parise . 20061 ). which is usually much more challenging than the Bayesian in- 



ference in directed Bayesian networks. When the convex regularization is induced from a 
linear operator (e.g., expectation) of the posterior distributions, RegBayes can be solved 
with convex analysis theory. 

By allowing to use constraints directly on post-data posterior distributions, we believe 
that the extra flexibility of RegBayes can be beneficial and stimulate new developments 
in Bayesian nonparametrics and Bayesian inference in general. In this paper, we partic- 
ularly concentrate on illustrating how to use the ideas of RegBayes to push forward the 
interface between Bayesian nonparametrics and large margin learning, which have comple- 
mentary advantages but have been largely treated a s two is o lated subfields in the commu- 
nity. As the core idea of support vector machines (jVapnikl . 119951 ) and maximum entropy 
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discrimination ( Jaakkola et al. . 19991) as well as their structured extensions of max-margin 
Marko v networks (Taskar et al. . 20031 ) and maximum entropy discrimination Markov net- 



works (jZhu and Xingl . 1200^ . large margin learning has shown great success in many sce- 



narios. But a large margin model rarely has the flexibility of nonparametric Bayesian 
models to automatically res olve model complexity from em pirical data, especially when 



latent variables are present fi^, H; EI^T^TID . Specifically, we develop the 



infinite latent support vector machines (iLSVM) and multi-task infinite latent support vector 
machines (MT- iLSVM), which explore the discriminative large-mar gin idea to learn infi- 



nite latent feature i nodel s for classification and multi-task learning (lArgyriou et alJ . 120071 : 



Bakker and Hesked . lioO^ I. respectively. Both iLSVM and MT-iLSVM are special cases of 



RegBayes that explore the large-margin principle to consider supervised information for 
learning predictive latent features, which are good for classification or multi-task learning. 
For iLSVM, we use the IBP prior to allow the model to have an unbounded number of 
latent features a priori. For MT-iLSVM, we use the similar IBP prior to infer a latent 
projection matrix to capture the correlations among multiple predictive tasks while avoid- 
ing pre-specifying the dimensionality of the projection matrix. The regularized inference 
problems can be efficiently solved with an iterative procedure, which leverages existing 
high-perfo rmance convex o ptimization techniques. As a by-product, we also show that 
MedLDA (|Zhu et al.l . l2009l l is a RegBayes model, but with a finite number of latent fea- 
tures. 

The rest of the paper is structured as follows. Section 2 discusses related work. Section 
3 presents the general framework of regularized Bayesian inference (RegBayes), together 
with the convex duality results that will be needed in latter sections. Section 4 concretizes 
the ideas of RegBayes and presents two infinite latent feature models with large-margin con- 
straints for both classification and multi-task learning. Section 5 presents some preliminary 
experimental results. Finally, Section 6 concludes and discusses future research directions. 



2. Related Work 



Bayesian inference is one of the most successful paradigms to model uncerta inty of empirical 
data arising in scientific and engineering domains. Bishop (jBishopl . lioO^ ) discusses many 
popular examples in his seminal book, but the book mainly focuses on finite parametric 
models. Recently, nonparametric Bayesian inference has attracted much attention in statis- 
tics and machine learning, and many proposals have been made towards developing a full 
Bayesian treatment of much richer forms of objects, such as sequential data, grouped data , 
data with a tree structure and relational data. Gershman and Blei (jGershman and Blei 
presents a nice tutorial on this subject. 



2011 



Expectation regularization or expectation constraints have also been considered to reg- 
ularize model parameter estimation in the context of semi-supervised lear n ing o r learning 
with weakly labeled data. Mann and McCallum ( Mann and McCallum . 20 id ) summa- 
rizes the recent developments of the generalized expectation (GE) criteria for training a 
discrin iinative probabilistic model (e.g., maximum entropy models or conditional random 



fields (jLaffertv et al.l . l200ll )) with unlabeled data. By providing appropriate side infor- 



mation, such as labeled features or estimates of label distributions, a GE-based penalty 
function is defined to regularize the model distribution, e.g., the distribution of class labels. 
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One commonly used GE function is the KL-divergence between empirical expectation and 
model expectation of some feature functions. Although the GE criteria can be used alone 
as a scoring function to estimate the unknown parameters of a discriminative model, it is 
more usually used gularization term to a n estimation metho d, such as maximum 

(conditional) likelihood estimation. Bellare et al. ( Bellare et al. . 20091 ) presented a different 
formulation of using expectation constraints in semi-supervised learning by introducing an 
auxiliary distribution to GE, to gether with an alte rnating projection algorithm, which can 
be more efficient. Liang et al. ( Liang et al. . 20091 ) proposed to use the general notion of 
"measurements" to encapsulate the variety of weakly labeled data for learning an exponen- 
tial family model. The measurements can be labels, partial labels or other c onstraints on 



model predictions. Under the EM framework, posterior constraints are used in (iGraca et al. 



20091 ) to modify the E-step of an EM algorithm to project the model posterior distributions 
onto the subspac e of distributions th at satisfy a set of auxiliary constraints. 

Dudik et al. (jPudik et al.l . 120071 ) studies the generalized maximum entropy principle 



with a rich form of expectation constraints using convex duality theory, where the standard 
moment matching constraints of maximum entropy are relaxed to inequality constraints. 
But their analysis was restricted to KL-divergence minimization (maximum entropy princi- 
ple is a special case) and the fin ite dimensional space of observations. Later on, Altun and 
Smola ( Altun and Smola . 20061 ) presents a more general duality theory for a family of di- 
vergence functions on Banach spaces. We have drawn a lot of inspiration from both papers 
to develop the regularized Bayesian inference framework using convex duality theory. 

Regularized Bayesian inference provides a computational framework for developing non- 
parametric Bayesian models with appropriate po sterior con s traints . The present paper 
provides a full extens ion of our prelimi nary work ( Zhu et al. . 2011bl lal). For example, the 



infinite SVM (iSVM) dzhu et al.l . l2011bl ) is a latent class model, where each data example is 
assigned to a single mixture component (i.e., an 1-dimensional space), and both iLSVM and 
MT-iLSVM extend the ideas to infinite latent feature models. For mul t i-task learning, non- 



param etric Bayesian models have been developed in (jXue et al.l . 120071 : iRai and Daume IIll . 



2O10l ) for learning features shared by multiple tasks. However, these methods are based 
on standard Bayesian inference, without the ability to conside r posterior regular i zation , 
such as the large-margin constraints or the manifold constraints ( Huh and Fienberg . 2O10l ). 
Finally, MT-iLSV M is a nonpararnetric Bayesian forrn ulation of the popular multi-task 
learning methods ( Ando and Zhang . 2005 : Jebara , 201ll ). 



3. Regularized Bayesian Inference 

In this section, we present the computational framework of regu larized Bayesi an inference. 
We begin with a brief review of the basic results due to Zellner ( Zellner . 19881 ). 



3.1 Bayesian Inference as a Learning Model 

Let M be a space, containing all the random variables of a physical generative process whose 
posterior distributions we are trying to infer from empirical data. Let us first consider the 
case of full Bayesian inference, where M is also the model space and each element € M 
represents a model. We will discuss the setting of empirical Bayesian inference shortly, 
where the model has some unknown model parameters. At the core of Bayesian inference 
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is Bayes' theorem, which offers a computational procedure to combine prior knowledge and 
empirical data. More formally, Bayesian inference starts with a prior distribution Tr{Ai) 
and a likelihood function p{x\A4) indexed by the model M € M. Then, given a collection 
of observed data V = {xi, • • • ,XAr}, the posterior distribution is 

p{V) p(xi,--- ,XAr) 

where p{'D) is the marginal likelihood or evi dence on obse rved data. Under the criteria of 
optimal information processing rule, Zellner ( Zellner . 19881 ) first showed that Bayes' rule is 



optimal and 100% efficient; and the posterior distribution due to the Bayes' theorem is the 
same as the optimum solution of the convex variational problem 

min Kh{p{M)\\TT{M))- f log p{V\M)p{M)dM (2) 
p{M) Jm 

S.t. : p{M) G Pprob, 

where KL{p{M.)\\-k{A4)) is the Kullback-Leibler (KL) divergence, and T^prob is the space 
of valid probability distributions with an appropriate dimension. The constraint is due to 
the law of conservation of belief. Zellner called p{Ai) a post-data distribution (a pdf for 
continuous variables) in order to distinguish it from the posterior distribution by Bayes' 
theorem. Given the equivalence, we will call p{A4) a posterior distribution in the sequel if 

no confusion arises. 

As commented by E.T. Jaynes ( Zellner . 19881 ). "this fresh interpretation of Bayes' theo- 



rem could make the use of Bayesian methods more attractive and widespread, and stimulate 
new developments in the general theory of inference". Below, we study how to extend the 
basic results to incorporate posterior constraints in Bayesian inference. 



3.2 Regularized Bayesian Inference with Expectation Constraints 

In standard Bayesian inference, although the constraint due to the law of conservation of 
belief (i.e., p{A4) € "Pprob) does not consider domain knowledge or structures, the above 
formulation offers one way to extend the scope of Bayesian inference. Formally, we present 
regularized Bayesian inference (RegBayes) as a novel computational procedure to combine 
prior knowledge and empirical data by solving the constrained optimization problem 

min KL{p{M)\\tt{M))- [ log p{T>\M)p{M) dM + U{$) (3) 
piM)4 Jm 

s.t. : p{M) e Ppost(^), 

where Ppost(^) is a subspace of distributions that satisfy a set of constraints besides the 
standard normalization constraints of a probability. To distinguish, we will call a problem 
unconstrained if it only has the standard normalization constraints or does not have any 
constraints at all. 

Although different types of constraints could arise in practice, this paper focuses on the 
expectation constraints, of which each one is a function of p{Ai) through an expectation 
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(a) (b) 

Figure 1: Illustration for the (a) hard and (b) soft constraints in the simple setting which 
has only three possible models. For hard constraints, we have only one feasible 
subspace. In contrast, we have many (normally infinite for continuous ^) feasible 
subspaces for soft constraints and each of them is associated with a different 
complexity or penalty, measured by the U function. 



operator. For instance, let ipt be a feature function defined on A^, a constraint can be of 
the form 

h{Ep{^l^t)) < 6, (4) 

where E is the expectation operator, i.e., Ep^Tpt) = E_M^p['ipt{M)]- The auxiliary param- 
eters ^ are usually nonnegative and interpreted as slack variables. The constraints with 
non-trivial $, are soft constraints. But we emphasize that by defining U as an indicator 
function, the formulation ([3]) covers the case where hard constraints are imposed. For 
instance, if we define 

t 

where 1(c) is an indicator function that equals to if the condition c is satisfied; otherwise oo. 



then all the expectation constraints are hard constraints. As illustrated in Figure 1(a) 
hard constraints define one single feasible subspace (assuming to be non-empty). In general, 
we assume that f/(^) is a convex function, which measures the complexities of the feasible 



subspaces, as illustrated in Figure 1(b) A larger subspace typically leads to a higher 
complexity. In the classification models to be presented, U corresponds to a surrogate loss, 
e.g., hinge loss of a prediction rule, as we shall see. In fact, the constrained formulation of 
RegBayes can be equivalently written in an "unconstrained" form 

min KL{p{M)\\'k{M))- I \ogp{V\M)p{M)dM + g{Ep{M)) (5) 

p(A4)G-Pprob Jm 

If we have T features, then the linear operator E (i.e., expectation) maps p to a point in 
M"^. We assume that the real- valued function g : — )• M is convex and left lower semi- 
continuous. For each U, we can induce a g function; vise versa. If we use hard constraints. 
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similar as in regu larized maximum entropy density estimation ( Altun and Smola , 20061 : 



Dudfk et alj . 120071 ) , we will have 



g{Ep) = Y,mEp{i^t))<7t). 



(6) 



For the regularization function g, as well as U, we can have many choices, besides 
the above mentioned indicator function. For example, if we could obtain empirical ex- 
pectations of some feature functions from observed data, one natural regularization 
function would be the KL-divergence between empirical expectations and the expectations 
computed from the model distribution, i.e., g{Ep) = KL(Ep[^/^j] ||£'p('(/'t)) or the gen- 
eralized Bregman divergence for unnorrnalized expectations. This regularization function 
has been used in dMann and McCalluml . boid l for label regularization, in the context oi' 
semi-supervised learning. Other choices include equ ality constraints, box constraints, and 
^2 penalty (Please see Table 1 in ( Dudfk et al.l . 2007 ) for a summary). We will also present 
three new examples shortly for developing latent support vector machines. 



3.2.1 Generalization Beyond Bayesian Networks 

Standard Bayesian inference and the proposed RegBayes implicitly make the assump- 
tion that the model can be graphically drawn as a Bayesian network as illustrated in 
Figure 2(aJ. Here, we consider a more general formulation which could cover both di- 
rected and undirected latent variable models, such as the well-studied Boltzmann ma- 



chines (jMurrav and Ghahramanil . 12004 : IWelling et al.l . |2004| ). as well as the case where a 
model could have some unknown parameters (e.g., hyper-parameters) and need an estima- 
tion procedure, such as maximum likelihood estimation (MLE), besides posterior inference. 
The latter is also known as empirical Bayesian methods, which are frequently employed by 
practitioners. 

Extension 1: Empirical Bayesian Inference with Unknown Parameters: As 



illustrated in Figure 2(b) , in some cases we need to perform the empirical Bayesian inference 



in the presence of unknown parameters. For instance, in a linear-Gaussian Bayesian model, 
we may choose t o estimate its co variance matrix using MLE; and in a latent Dirichlet 
allocation (LDA) (iBlei et al.l . l2003l ) model, we may choose to estimate the unknown topical 
dictionary, although in principle we can treat these parameters as random variables and 
perform full Bayesian inference. In such cases, we need some mechanisms to estimate the 
unknown parameters when doing Bayesian inference. Let be model parameters. We can 
formulate empirical Bayesian inference as solving 



mm 

e,p[M\e) 

s.t. : 



KL(p(7W|e)||7r(X)) ■ 



M 



\ogp{V\M, Q)p{M\Q)dM 



(7) 



Although the problem is convex over p[Ad\Q) for any fixed 0, it is not jointly co nvex in gen 



eral. A natural algorithm to solve this problem is the well-known EM procedure (jPempster et al 



19771 ). which converges to a local optimum. Specifically, we have the following result. 



2. The structure within M can be arbitrary, either a directed, undirected or hybrid chain graph. 

3. The objective can be derived using variational techniques. It is in fact a variational upper bound of the 
negative log-likelihood. 
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Figure 2: Illustration graphs for three different types of models that involve Bayesian in- 
ference: (a) a Bayesian generative model; (b) a Bayesian generative model with 
unknown parameters G; and (c) a chain graph model. 



Lemma 1 For problem ([^, the optimum solution of p{A4\Q) is equivalent to the posterior 
distribution by Bayes' theorem for any Q; and the optimum 0* is the MLE 

B* = argmaxlogp(P[G). 
e 

Proof [Sketch] For any 0, by the calculus of variation and Lagragian methods with a 
Lagrange multiplier we can get that the optimum solution of p{^A\Q) is p*{^A\Q) = 
— \ .s^ t:{M.)p{'D\M., 0). Due to the normalization constraint, we have exp(l + C) = 
^(1510). Thus, p*{M.\Q) is the posterior distribution inferred via the Bayes' theorem. Sub- 
stituting p*[M\Q) into the objective of problem d?]), we prove the second half. ■ 



Extension 2: Chain Graph: In the above cases, we have assumed that the observed 
data are generated by some model in a directed causal sense. This assumption holds in 
directed latent variable models. However, in many cases, we may choose alternative for- 



mulations to define the joint distribution of a model and the observed data. Figure 2(c 
illustrates one such scenario, where the model M consists of two subsets of random vari- 
ables. One subset H is connected to the observed data via an undirected graph and the 
other subset Z is connected to the observed data and H using directed edges. This graph is 
known as a chain graph. Due to the Markov properties of chain graph ( Frydenberg . 19901 ). 
we know that the joint distribution has the factorization form as 

p{M,V)=p{Z)p{H,V\Z), (8) 

where p{H,'D\Z) is a Markov random field (MRF) . One concrete example of such a hybrid 
chain model is the Bayesian Boltzman machines ( Murrav and Ghahramani . 20041 ) . which 



treat the parameters of a Boltzmann machine as random variables and perform Bayesian 
inference with MCMC sampling methods. 
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The insights that RegBayes covers undirected or chain graph latent variable models 
come from the observation that the objective Cb{p{M)) of problem ^ is in fact an KL- 
divergence, namely, we can show that 

CsipiM)) = KUpiM)\\p{M,V)), (9) 

where p{A4,'D) is the joint distribution. For directed Bayesian networks ( Zhu et al. . 201 lal ). 



we naturally have p{A4,T>) = '7T{A4)p{T>\Ai). For the undirected MRF models, we have 
Ai = {Z, H} and again we can define the joint distribution as in Eq. ([8]). 

Putting the above two extensions of Bayesian inference together, the regularized Bayesian 
inference with estimating unknown model parameters can be generally formulated as 

min CB{Q,p{M\e)) + U{$,) or min Cb{Q,p{M\G)) + g{Ep{M)) (10) 
e,p(A4|0),$ e,p{M\e) 

s.t. : p{M\e) G Ppost(e,^) s.t. : p{M\e) G Pprob(e), 

where >Cb(0,p(A4|0)) if the objective function of problem ([7]). These two formulations 
are equivalent. We will call the former a constrained formulation and call the latter an 
unconstrained formulation by ignoring the standard normalization constraints, which are 
easy to deal with. 



3.2.2 Optimization with Convex Duality Theory 

Depending on several factors, including the data likelihood model, the prior and the regu- 
larization function, a RegBayes problem in general is highly non-trivial to solve, either in 
the constrained or unconstrained form. Furthermore, as we have discussed, if is non- 
empty, the problem is not joint convex, and we need to resort to an iterative procedure. 
For example, an EM procedure to solve the unconstrained form could be that we iteratively 
solve for p{A4\Q) with fixed; and solve for with p{M\Q) given. The second step can 
be solved with numerical methods, such as gradient descent. Below, we focus on solving 
the first step, which is also the whole RegBayes problem for a full Bayesian model (i.e., 
is null). We know that the first step is convex if U and g are convex. In this section, 
we present some results of convex analysis theory to deal with the convex RegBayes prob- 
lem ([5]) with expectation regularization or the first step of the iterative procedure that solves 
problem (fTOl) . 

To make the following statements general, we consider the following problem 

min f{x) + g{Ax) (11) 

where / : — ?> M is a convex function; A : X ^ B is a bounded linear operator; and 
g : B ^ M is also convex. Here, we introduce the convex analysis theory to study this 
problem by formulating the primal-dual space relations of convex optimization problems in 
the general settings, where both X and B are Banach spaces. One important result is the 
Fenchel duality theorem. 

Definition 2 (Convex Conjugate) Let X be a Banach space and X* be its dual space. 
The convex conjugate or the Legendre-Frenchel transformation of a function f : X ^ 
[— oo,+oo] is f* : X* ^ [— cx3,+oo], where 

r(x*) = sup{(x,x*)-/(x)}. (12) 
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Theorem 3 (Fenchel Duality ( Borwein and Zhu . 20051 )) Let X andB he Banach spaces, 
/ : A" — M U {+00} and g : ;B — M U {+00} he convex functions and A : X ^ B he a 
hounded linear map. Define the primal and dual values t, d by the Fenchel problems 

t = inf {f{x) + g{Ax)} and d = sup {-f*{A*x*) - g*{-x*)}. 

x&X x*eB* 

Then these values satisfy the weak duality inequality p > d. If f , g and A satisfy either 

€ core(dom(7 — A<lova.f)and both f and g are left side continuous {I sc), (13) 



or 



Adomf n cont^ ^ 0, 
then t = d and the supremum to the dual problem is attainable if finite. 



(14) 



The Fencel duahty theorem can be applied to solve divergence minimization problems 

for density estimation ( Altun and Smola . 2006 : Dudik et al. . 2007 ). Let ip '= {ipi, • • • , ipr) 
be a vector of feature functions ijjt '■ X ^ B and let A be the expectation operator of the 

feature functions with respect to the distribution p on X, that is, Ap Kx^p[ip{x)], where 
rp{x) = (^/>i(x), • • • , tpTix)). Given a set of observed data V = {xd}^^i, we let denote the 
observed empirical values of the features, namely, tj) = ^ X^d=i '^{^d)- Then, when the / 
function is a KL-divergence and the constraints are relaxed moment matching constraints, 
the following result can be proved. 



Lemma 4 (KL-divergence with Constraints (I Altun and Smolal . l2006l )) 
|KL(p||g) s.t. : ||Ep[i/'] - < e and p € Pprob} 



mm 

p 



(15) 



max 




1(^,-0) -log / q{x)exp{{cl),il){x)))dx - e\ 



where the unique solution is given by p^{x) 
of the dual problem. 



q{x) exp((0, ip{x)) — Al ) and cf) is the solution 



The problem in the above lemma has hard constraints, and the corresponding g is 
the indicator function I(||Ep[i/;] — tpW^ < e) in order to apply the Fe nchel duality theo- 
rem. Many other examples of the posterior constraints can be found in (iDudfk et al.l . 120071 : 
Mann and McCalluml . I2010I ). as we have discussed in Section 13.21 In this paper, we con- 
sider the general soft constraints as in the RegBayes problem ([3]). Furthermore, we do not 
assume the existence of a fully observed dataset to compute the empirical expectation 0. 
Specifically, we have the following result. 

Lemma 5 (RegBayes) Let E he the expectation operator with feature functions tp defined 
on M, and assume g is convex. We have 



min \kL{p{M)MM,V)) + g{Ep) s.t. : p{M) G Vp.ob} 
p(M) ^ J 



max 



(16) 



{ - log / 



log / p{M,V)exp{{ct),ip{M)))dM - g 
M 



i-cp)}, 
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where the unique solution is given by p^{A4) = p{A4,T?) exp{{cf),ip{A4)) — A^) and 4> is the 
solution of the dual problem. 

Now, we derive the conjugate functions of three other important examples, which will be 
used shortly for developing the infinite latent SVM models. We defer the proof to Appendix. 
Specifically, the first one is the conjugate of a simple function, which will be used in a binary 
latent SVM classification model. 

Lemma 6 Let go -.M —^M be defined as go{x) = C max(0, x). Then, we have 

The second function is slightly more complex, which will be used for defining a multi- 
way latent SVM classification model. Specifically, let = {x G R-^ : 3j, Xj > 0}. We 
define the function gi : ^ — >■ R as 

5ri(x) = Cmax(x), (17) 

where ma x(x) =^ maxfxi , • • • , Xf,). App arently, gi is convex because it is a point-wise 
maximum ( Bovd and Vandenberghe , 20041 ) of the simple linear functions 0j(x) = Xj. Then, 
we have the following results. 

Lemma 7 The convex conjugate of gi (x) as defined above is 

glin) = l(yi,fj.i > 0; and ^^/^j < C^- 

j 

Let G' '= {x G R^ : xi + X2 = 0} and (ci, C2) are constants. The last function that we 
are interested in is g2 ■ G' where 

5r2(x;ci,C2) = C(max(0, xi -ci) + max(0, X2 -C2)). (18) 

Then, we have the following lemma, which will be used in developing large-margin regression 
model. 

Lemma 8 The convex conjugate of g2 (x) as defined above is 

g2{fJ.;ci,C2) = (ci/xi + C2/X2) + l{yi,0 < m < C; and ^1^2 = 0). 



Note that although g2 and g2 are defined in two dimensional spaces, the feature values 
of X or /i are in fact lying in lower (i.e., 1) dimensional subspaces because of the constraints. 

3.3 MedLDA: A RegBayes Model with Finite Latent Features 

Before we present the nonparametric regularized Bayesian models, which could have an 
unbounded number of hidden units, we end this section with a new interpretation of the 
previously p roposed M e dLDA (maximum entropy discrimination latent Dirichlet alloca- 
tion) model dZhu et al.1 . boOfll ) under the framework of regularized Bayesian inference. In 
MedLDA, each data example is projected to a point in a finite dimensional latent space, of 
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which each feature corresponds to a topic, i.e., a unigram distribution over the terms in a 
vocabulary. MedLDA represents each data as a probabihty distribution over the features, 
which results in a conservatio n constraint (i.e., the more a dat a expresses on one feature, 
the less it can express others) ^Griffiths and Ghahramani M . The infinite latent feature 



models discussed later do not have such a constraint. 

Without loss of generality, we consider the MedLDA regression model as an example 
(classification model is similar), whose graphical structure is shown in Figure [3l where we 
have assumed all data examples have the same length V for notation simplicity. Let K be the 
number of topics or the dimensionality of the latent topic space. Define Zn = y Ylm=i ^nm 
and let = {a,/9,5^} denote the unknown model parameters and D = {yn,Wnm} be the 
training set. MedLDA was defined as solving a regularized MLE problem with expectation 
constraints 

N 

min -logp{{yn,Wnm}\Q) + CY^{U + C) (19) 

yn-Kp[ri'^Zn] < e + Cn 
s.t. Vn : { -yn + IEp[r/"^Z„] < e + C 

Cn, C > 

The posterior constraints are imposed following the large-margin principle and they corre- 
spond to a quality measure of the prediction results on training data. In fact, it is easy to 
show that minimizing U (^, ^*) = C X]^=i(?n +^n) under the above constraints is equivalent 
to minimizing an e-insensitive loss (Smola and Scholkopf, 2003) 

N 

n,{p{{9n,Znm,v}\'^,Q)) = C^max{0,\yn -Epirj'^ Zn]\ - e). (20) 

n=l 

of the expected linear prediction rule y„ = Kplrj ^ Zn]. Therefore, MedLDA can be seen 
as belonging to the framework of the GE criteria ( Mann and McCallum . 2O10l ). but in the 
context of large-margin learning. 

To practically learn an MedLDA model, since the above problem is intractable, vari- 
ational methods were used by introducing an auxiliary distribution q{{On, Znm,'n}\®) @ to 
approximate the true posterior p({6'„, Znm, ^H^, ©)) replacing the negative data likelihood 



4. We have explicitly written the condition on model parameters. 
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with its upper bound ^}|0)) > and replacing p by (7 in the constraints. The 

variational MedLDA regression model is 

N 

min^ jC(q{{en,Znm,v}\&))+Cy2{^n + C) (21) 
s.t. Vn: <^ -yn + Eg[r]'^Zn] < e + C 

I Cn, C > 

where C{q{{en, Znm,v}\Q)) = -Eg [logp({6'„, z„m, ry}, P|e)] - ?^(g({6'„, z„m, ?7}|e)) is a 
variational upper-bound of the negative data log-likelihood. Note that the upper bound is 
tight if no restricting constraints are made on the variational distribution q. In practice, 
additional assumptions (e.g., mean-field) can be made on q to derive a practical approximate 
algorithm. 

Based on the previous discussions on the extensions of RegBayes and the duality in 
Lemma [H we can reformulate the MedLDA regression model as an example of RegBayes. 
Specifically, for the MedLDA regression model, we have ^A = {On, ZnmjV}- According to 
Eq. Q, we can easily show that 

vnm) 

= CB(@,q{M\@)' 



Then, the MedLDA problem is a RegBayes model in Eq. (jlOp with 



^p™^(0,^,r) = <^ q{{On,Znm,Vm 



Vn: yn-^q[v Zn]<e + i 

-y^ + Eg[rj'^Zn]<e + Cn} ■ (22) 



For the MedLDA problem, we can use Lagrangian methods to solve the constrained for- 
mulation. Alternatively, we can also use the convex duality theorem to solve the equivalent 
unconstrained form. For the variational MedLDA, the e-insensitive loss is TZ^{q{{6n, Znm,'n}\'^))- 
Its conjugate can be derived using the results of LemmaO Specifically, we have the following 
result, whose proof is deferred to Appendix A. 4. 

Lemma 9 (Conjugate of MedLDA) For the variational MedLDA problem, we have 

min C{q{{0n,Znm,V}\&),Q)+T^e{q{{en,Znm,V}\&)) (23) 

0,l3({e„,Z„„,77}|0)e'Pprob 

= max -logZ'{u;,e*) -y^g2{uJn]-yn + e,yn + e), 

n 

where a;„ = (wn,w^). Moreover, The optimum distribution is the posterior distribution 

qi{On,Znni,V}\&*) = q^x P({^n, ^nm, T?}, P|9*) CXp | "^{Un - Uj'nW Zn^ (24) 

^ ' ' n 

where Z'{u;,Q) is the normalization factor and the optimum parameters are 

©* = argmaxlogp(P|0). (25) 
e 
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Note that although in general, either the primal or the dual problem is hard to solve 
exactly, the above conjugate results are still useful when developing approximate inference 
algorithms. For instance, we can impose additional mean-field assumptions on q in the 
primal formulation and iteratively solve for each factor; a nd in this proces s convex conju- 
gates are useful to deal with the large-margin constraints ( Zhu et al. . 20091 ) 
we can apply approximate methods (e.g., MCMC sampling 



. Alternatively, 
to infer the q based on its 



solution in Eg. (|24[ ). and iteratively solves for the dual parameters a; using approximate 
statistics (jSchofield . 2006). We will discuss more on this when presenting the inference 
algorithms for iLSVM and MT-iLSVM. 

In the above discussions, we have treated the topics (3 as fixed unknown parameters. 
A fully Bayesian form ulation would treat as random variables, e.g., with a Dirichlet 
distribution prior as in (|Blei et al.l . l2003l : [Griffiths and Stevversl . 120041 ) . Under the RegBayes 
interpretation, we can easily do such an extension of MedLDA, simply by moving (3 from 
e to M. 



4. Infinite Latent Support Vector Machines 

As we have stated above, MedLDA is a RegBayes model which has a finite number of latent 
features (i.e., topics) and the dimensionality is pre-specified. In this section, we present two 
nonparametric RegBayes models to illustrate how to develop latent large-margin classifiers 
and automatically resolve the unknown dimensionality of latent features from data. We 
consider two settings. For single-task classification, we consider learning latent features 
that can be used as a representation of examples to make prediction, and for multi-task 
learning, we consider learning a common latent projection matrix that captures relationships 
among the multiple tasks. 

We first present the single-task classification model. The basic setup is that we project 
each data example x € <^ C to a latent feature vector z. Here, we consider binary 
feature^. Given a set of N data examples, let Z be the matrix, of which each row is a 
binary vector z„ associated with data sample n. Instead of pre-specifying a fixed dimension 
of z, we resort to the nonparametric Bayesian methods and let z have an infinite number 
of dimensions. To make the expected number of active latent features finite, we put the 
well-studied IBP prior on the binary feature matrix Z. 



4.1 Indian Buffet Process 



Indian buffet process (IBP) was proposed in (jGriffiths and Ghahramanil. |2005|) a nd has 
been successfully applie d in various fields, such as link prediction (jMiller et al.l . [200a ) 
and multi -task lear; m. feai and Daume Ili H). We focus on its stick-breaking con- 



struction (iTeh et al.l . 120071 ). which is good for developing efficient inference methods. Let 



TTfc G (0, 1) be a parameter associated with each column of the binary matrix Z. Given tt^, 
each Znk in column k is sampled independently from Bernoulli (vTfc). The parameter tt are 



5. Real-valued features can be easily considered as in (| Griffiths and Ghahramanl |2005| ). 
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generated by a stick-breaking process 

k 

vTi = i/i, and vTfc = UkT^k-i = 11^*' 

1=1 

where Ui ~ Beta(Q!, 1). This process results in a decreasing sequence of vTfc. Specifically, 
given a finite dataset, the probability of seeing feature k decreases exponentially with k. 

4.2 Infinite Latent Support Vector Maciiines 

We consider the multi-way classification, where each training data is provided with a cate- 
gorical label y, where y€ 3^=^ {!,••• ,L}. For binary classification and regression, similar 
procedure can be applied to impose large-margin constraints on posterior distributions. 
Suppose that the latent features z are given, then we can define the latent discriminant 
function as linear 

f{y, X, z; J7) =^ r}'^g{y, x, z), (27) 

where g{y, x, z) is a vector stacking L subvector^ of which the yth is z^ and all the others 
are zero. Since we are doing Bayesian inference, we need to maintain the entire distribution 
profile of the latent features Z. However, in order to make a prediction on the observed 
data X, we need to remove the uncertainty of Z. Here, we define the effective discriminant 
function as an expectatior0 (i.e., a weighted average considering all possible values of Z) 
of the latent discriminant function. To make the model fully Bayesian, we also treat ?7 
as random and aim to infer its posterior distribution from given data. More formally, the 
effective discriminant function f : X x y M is 

f {y, x; p(Z, r/)) =^ Ep(z,r?) [fiv, x, z; r/)] = Ep(z,^) [v'^g{y, x, z)] , (28) 

where p(Z, ry) is the posterior distribution we want to infer. 

With the above definitions, we define the ■Ppost(^) in problem ([3]) using softEl large-margin 
constraints as 



= ^P(Z,r?) 



Vn G Xtr : /(yn,x„;p(Z,r/)) - f{y,Xn;p{Z,ri)) > £^{y) -C„,Vy 
>0 



and define the penalty function as 



6. We can consider the input features x or its certain statistics in combination with the latent features z 
to define a classifier boundary, by simply concatenating them in the subvectors. 

7. Although other choices such as taking the mode are possible, our choice could lead to a computationally 
easy problem because expectation is a linear functional of the distribution u nder which the exp ectation 
is taken. Moreover, exp ectation can be more ro bust than taking the mode (|Khan et al.l , |2010| ). and it 
has been widely used in l|Zhu et al.1 . |2009| . l2011bh . 

8. Hard constraints for the separable cases are covered by simply setting ^ = 0. 
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where p > 1. If p is 1, minimizing is equivalent to minimizing the hinge- loss (or 

£i-loss) 7^^ of the prediction rule ([35]) . where 

7^^ = C V max(/(y,x„;p(Z,r7))+£^(y)-/(y,,x„;KZ,r7))); 

if p is 2, the surrogate loss is the ^2-loss. For clarity, we consider the hinge loss. The 
non-negative cost function l^iv) (^-g-) 0/1-cost) measures the cost of predicting x„ to be y 
when its true label is ^tr is the index set of training data and / is an identity matrix 
with appropriate dimensions. 

In order to robustly estimate the latent matrix Z, we need a reasonable amount of data. 
Therefore, we also relate Z to the observed data x by defining a likelihood model to provide 
as much data as possible. Here, we define the most common linear-Gaussian likelihood 
model for real-valued data 

p(x„!z„,W,a2o) =AA(x„lWzT,^72o/), (29) 

where W is a random loading matrix. We assume W follows an independent Gaussian 
prior, i.e., 7r(W) = W(i-^i^d\Q^'^Q^)- Figure|4](a) shows the graphical structure of iLSVM. 
The hyperparameters and can be set a priori or estimated from observed data (See 
Appendix A. 7 for details). 

Training: Putting the above definitions together, we get the RegBayes problem for 
iLSVM in the following two equivalent forms 

min KL(p(Z,r?,W)|b(Z,77,W,P)) + C/'^(0 (30) 

p{Z,r,,W),^ 

s.t. : p(Z,r7,W) eP^o3t 
^ min KL(p(Z,r7,W)|jp(Z,r7,W,P))+7^^^(p(Z,T7)) (31) 

p(Z,r?,W)e-Pp,ob 

Here, p(Z,?7, W) € "Ppost means that the marginal distribution p{Z,r]) belongs to T^post- 

Note that in order to be a valid RegBayes model, we need to ensure that the objective 
function and the posterior constraints have finite values. This can be verified as follows. 
Although the number of latent features is allowed to be infinite, with probability one, 
the number of non-zero features is finite when only a finite number of data are observed, 
under the IBP prior. M oreover, becau se of the facts that the KL-term in Eq. ([3]) has the 



"zero forcing" property (jBishopl . |2006| . Chap. 10) and the prior distribution of feature Znk 
decreases exponentially as k increases, we can expect that the posterior distribution of 
feature Znk also decreases exponentially, when a finite set of data is observed. Thus, both 
the objective function and the large-margin constraints are well-defined. Finally, to make 
the problem computationally feasible, we usually set a finite upper bound K to the number 
of possible features, where K is sufficiently large and kn own as the truncation l evel (See 
Section H7i] and Appendix A. 7 for details). As shown in ( Doshi-Velez et al. . 20091 ). the ii- 



distance truncation error of marginal distributions decreases exponentially as K increases. 

Directly solving the iLSVM problems is not easy because either the posterior constraints 
or the non-smooth regularization function TZ'^ is hard to deal with. Thus, we resort to convex 
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duality theory, which will be useful for developing approximate inference algorithms, as 
we have discussed in Secti on |3.3[ We can e ither solve the constrained form (|30p using 
Lagrangian duality theory ( Ito and Kunisch . 20081 ) or solve the unconstrained form ()3ip 
using Fenchel duality theory. Here, we take the second approach. In this case, the linear 
operator is the expectation operator, denoted by E : Pprob — > M^^^ and the element of 
Ep evaluated at y for the dth. example is 

Ep{n,y)=^ f{yn,:x~n;p{'Z',v)) - /(y,x„;p(Z,r/)) = Ep(z,r,) [^?^Ag„(y, Z)] , (32) 

where Ag„(j/, Z) =^ g(y„, x„, z) - g(y, x„, z). We can easily prove that Vn, maxy{£^{y) - 
Ep{n, y)) > 0. Then, let gi : M.^ — )• M be a function defined in the same form as in Eq. (llTh . 
We have 

n 

where Ep{n) =^ {Ep{n,l),--- ,Ep{n,L)) and £^ =^ (^^(l),--- ,£^{L)) are the vectors 
of elements evaluated for nth data. By the Fenchel's duality theorem and the results in 
Lemma El we can derive the conjugate of the problem (I3ip . The proof is deferred to 
Appendix A. 5. 

Lemma 10 (Conjugate of iLSVM) For the iLSVM problem, we have that 

min KL{p{Z,v,W)\\p{Z,rj,W,V))+nl{q{Z,r,)) (33) 

p(Z,T?,W)g-Pprob 

n y n 

where cj„ = (w^, • • • ,w^) is the subvector associated with data n. Moreover, The optimum 
distribution is the posterior distribution 

p(Z, r,, W) = ^P(Z, r/, W, V) exp { ^ ^|lT7^Ag„(y, Z)} , (34) 

^ ' n y 

where Z{u) is the normalization factor and u is the solution of the dual problem. 

Testing: to make prediction on test examples, we put both training and test data 
together to do the regularized Bayesian inference. For training data, we impose the above 
large-margin constraints because of the awareness of their true labels, while for test data, 
we do the inference without the large-margin constraints since we do not know their true 
labels. After inference, we make the prediction via the rule 

y* =^ argmax / (y , x; p(Z, rj)). (35) 

y 

The ability to generalize to test data relies on the fact that all the data examples share 
r} and the IBP prior. We can also cast the pro blem as a transd uctive inference problem 
by imposing additional constraints on test data ( Joachims . 19991 ) . However, the resulting 
problem will be generally harder to solve. 
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iBP(a) 








(a) 




(b) 



Figure 4: Graphical structures of (a) infinite latent SVM (iLSVM); and (b) multi-task in- 
finite latent SVM (MT-iLSVM). For MT-iLSVM, the dashed nodes (i.e., w) 
illustrate the task relatedness but do not exist. 



4.3 Multi-Task Infinite Latent Support Vector Machines 

Different from classification, which is typically formulated as a single learning task, multi- 
task learning aims to improve a set of related tasks through sharing statistical strength 
between these tasks, which are perforn ied join t ly. M any different approaches have been 



developed for multi-task learning (See ([Jebaral . l201lh for a review). In particular, learn 



ing a common latent representation shared by all the related task s has proven to be an 
effective way to capture t ask relationships ( Ando and Zhang . 2005 : Argvriou et al.l . 200?! : 



Rai and Daume Ilj . boid ). Below, we present the multi-task infinite latent SVM (MT^ 



iLSVM) for learning a common binary projection matrix Z to capture the relationships 
among multiple tasks. Similar as in iLSVM, we also put the IBP prior on Z to allow it to 
have an unbounded number of columns. 

Suppose we have M related tasks. Let Vm = {(xm„, ymn)}ngx™ be the training data 
for task ni. We consider binary classification tasks, where ym = Extension to 

multi-way classification or regression can be easily done. Figure [5)[a) shows the naive way 
that performs multiple tasks independently. In order to make the multiple tasks coupled 
and share statistical strength, MT-iLSVM introduces a latent projection matrix Z. If the 
latent matrix Z is given, we define the latent discriminant function for task m as 

fmi^mnj'^] Vm) ~ i'^Vm) ^mn — Vmi'^ ^mn); (^6) 

where x^n is one data example in D^- This definition provides two views of how the M 
tasks get related. 

(1) If we let (^rn = "^Vrm then is the actual parameter of task m and all <^m in different 
tasks are coupled by sharing the same latent matrix Z, as illustrated in Figure [5]^b) ; 

(2) Another view is that each task m has its own parameters r/^, but all the tasks share the 
same latent projection matrix Z to extract latent features Z~'^Xm„, which is a projection 
of the input features x^n) as illustrated in Figure [5]^c). 
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(a) (b) (c) 

Figure 5: Illustration of (a) multiple single task learning, where each task m (represented 
by the model rj^) is performed independently; (b) related multiple tasks in MT- 
iLSVM with the first type of representation, where all the M models need to 
pass a common transformation (denoted by the matrix Z) in order to act on 
input data; and (c) related multiple tasks in MT-iLSVM with the second type of 
representation, where input data are projected into latent representations using 
the same projection matrix Z. 



As such, our method can be vi ewed as a nonparametric Bayesian treatment of alternating 
structure optimization (ASO) MnHo and Zhanol . » , which learns a si ngle proiectiq n 
matrix with a pre- specified latent dimension. Moreover, different from 
which learns a binary vector with known dimensionality to select features or kernels on x, 
we learn an unbounded projection matrix Z using nonparametric Bayesian techniques. 

As in iLSVM, we take the fully Bayeisan point of view and treat ij^ as random and 
define the effective discriminant function for task m as the expectation 



fm{x;piZ,r])) =^Ep(z^^)[/„(x,Z;r7„)] = Ep(z^^) [Zr/ J^x. 

def 



(37) 



Then, the prediction rule for task m is naturally =^ sign/m(x). Similarly, we do regu- 
larized Bayesian inference by imposing the following constraints and defining 



and 



def 



Vm, yneT(^ : ymn^p{Z,r,)['^VmV^mn > 1 



(38) 



Finally, to obtain more data to estimate the latent Z, we also relate it to observed data by 
defining the likelihood model 



p(Xmn |Wmra, Z, A^„) — J\f (xmra|ZWmn; ^mn-^) i 



(39) 
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where is a vector. We assume W has an independent prior 7r(W) = Y[mn-^i^mn\0,(T'^Ql)- 
Fig. m (b) illustrates the graphical structure of MT-iLSVM. 

For training, we can derive the similar convex conjugate as in the case of iLSVM. Similar 
as in iLSVM, minimizing {7*^-^(^) is equivalent to minimizing the hinge-loss 'R-ff^ of the 
multiple binary prediction rules, where 

nt'^{q{Z,rj))=C J2 max(0,l-2/„„Ep(z,^)[Zr7jTx„„). (40) 

Thus, the RegBayes problem of MT-iLSVM can be equivalently written as 

min KL (p(Z, rj, W) ||p(Z, rj, W, V)) + Tlf ^ {q{Z, r,)) . (41) 

Then, by the Fenchel's duality theorem and Lemma [6l we can derive the conjugate of 
MT-iLSVM. The proof is deferred to Appendix A. 6. 

Lemma 11 (Conjugate of MT-iLSVM) For the MT-iLSVM problem, we have that 

min KL(p(Z,r?,W)||p(Z,r?,W,P))+7ef^(g(Z,r,)) (42) 

p(Z,77,W)gT'p,„b 

= max -logZ'{u;) +'^UJjnn-'^9o{^mn)- 

m,n m,n 

Moreover, The optimum distribution is the posterior distribution 

p{Z,r],W) = -^^p{Z,rj,W ,V) exp ^^ymnU}rnn{Z'n^)~^Xmny (43) 

^ ' m,n 

where Z'{u:) is the normalization factor and u is the solution of the dual problem. 

For testing, we use the same strategy as in iLSVM to do Bayesian inference on both 
training and test data. The difference is that training data are subject to large-margin 
constraints, while test data are not. Similarly, the hyper-parameters cx^g and A^^ can be 
set a priori or estimated from data (See Appendix A. 7 for details). 



4.4 Inference with Truncated Mean-Field Constraints 

We discuss how to do regularized Bayesian inference ([3|) with the large-margin constraints 
for both iLSVM and MT-iLSVM. From the primal-dual formulations, it is obvious that 
there are basically two methods to perform the regularized Bayesian inference. One is to 
directly solve the posterior distribution p{Z,r],W), and the other is to first solve the dual 
problem for the optimum u and then infer the posterior distribution. However, both the 
primal and dual problems are intractable to solve for iLSVM and MT-iLSVM. The intrinsic 
hardness is due to the mutual dependency among the latent variables in the de sired posterior 



distr ibution. Therefore, a natural approximation method is the mean field (jJordan et al. 



199^), which breaks the mutual dependency by assuming p is of some factorization form. 



This method approximates the original problems by imposing additional constraints. An 
alternative method is to apply approximate methods (e.g., MCMC sampling) to infer the 
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Algorithm 1 Inference Algorithm for Infinite Latent SVMs 
1: Input: corpus V and constants (a,C). 
2: Output: posterior distribution Z, J7, W). 
3: repeat 

4: infer p{i'),p(W) and p(Z) with p{t]) and u given; 
5: infer p{r]) and solve for u: with p(Z) given. 
6: until convergence 



true posterior distributions derived via convex conjugates as above, and iteratively estimate 
the dual p arameters using approximate statistics (e.g., feature expectations estimated using 
samples) (ISchofieldl . [iooi ). Below, we use MT-iLSVM as an example to illustrate the idea 



of the first strategy. A full discussion on the second strategy is beyond the scope of this 
paper. For iLSVM, similar procedure applies and we defer its details to Appendix A. 8. 

To make the problem easier to solve, we use the stick-breaking representation of IBP, 
which includes the auxiliary variable v, and infer the expanded posterior p{i', W, Z, ij). The 
joint model distribution is now W, Z, 77, D). Furthermore, we impose the truncated 
mean-field constraint that 

K D 

p(iy,W,Z,?7) =p(t7) Jl (pil^khk)]Jpizdk\M)'[lp(^mn\'^mn,Crmnl), (44) 
k=l d=l rnn 

where K is the truncation level, and we assume that 

Pi^khk) = Beta(7fci,7fc2), 
p{zdk\tpdk) = Bernouni(V'dfc), 

Then, we can use the duality theorjj^l to solve the RegBayes problem by alternating between 
two substeps, as outlined in Algorithm 1 and detailed below. 

Infer p{v), p(W) and p(Z): Since p{f) and p(W) are not directly involved in the 
posterior constraints, we can solve for them by using standard Bayesian inference, i.e., 
minimizing a KL-divergence. Specifically, for p(W), since the prior is also normal, we can 
easily derive the update rules for and ct^„. For p{v), we have the same update rules 



as m 



dPoshi-Velez et al.l . boool 'l . We defer the details to Appendix A. 7. 



For p(Z), it is directly involved in the posterior constraints. So, we need to solve it 
together with p{'q) using conjugate theory. However, this is intractable. Here, we adopt an 
alternating strategy that first infers p(Z) with ^(^7) and dual parameters cj fixed, and then 
infers ^(^7) and solves for a;. Specifically, since the large-margin constraints are linear of 
j3(Z), we can get the mean-field update equation as 



dk 



1 + e-^dk 



9. Lagrangian duality (|lto and Kunisclil . l2008h was used in JZhu et al.|. boilal) to solve the constrained 



variational formulations, which is closely related to Fenchel duality ( Magnaiitil . Il974l ') and leads to the 
same solutions for iLSVM and MT-iLSVM. 
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where 

k 

Y^M^ogv,] -CI- Y^-—({KcjI^ + {<pt^f) (45) 



^dk — /_^'^pi^^b ^Jl —fc 

j=l mn ™" 

~'^-^mn4'mn ~^ "^^^ 4'rnn4'mn''Pdj^ + ^ ^ Umn^plllmkjXmn, 

and is an lower bound of Ep[log(l — Y[^=i ^i)] (See Appendix A. 7 for details). The last 
term of i!)dk is due to the large-margin posterior constraints as defined in Eq. (j38p . We can 
how the large-margin constraints regularize the procedure of inferring the latent matrix Z. 

Infer p{ri) and solve for u: Now, we can apply the convex conjugate theory and show 
that the optimum posterior distribution of ij is 

Piv) = YlpiVm): where ^(ry^) cx 7r(r7^) exp{77^/^^}, 

m 

and /2„ = X^neX'" UmnOJmni'^'^ ^mn) ■ Here, we assume vr(?7^) is standard normal. Then, we 
have p{r]fyi) = M{r]^\fi^, I) and the optimum dual parameters can be obtained by solving 
the following M independent dual problems 



(46) 



max 

nex- 

Vn e Z™, s.t. : < ujmn < C, 

where the constraints are from the conjugate function in Lemma[TTJ These dual problems 
(or their primal forms) can be efficiently solved with a binary SVM solver, such as SVM-light 
or LibSVM. 

5. Experiments 

We present empirical results for both classification and multi-task learning. Our results 
appear to demonstrate the merits inherited from both Bayesian nonparametrics and large- 
margin learning. 

5.1 Multi-way Classification 

We evaluate the infinite latent SVM (iLSVM) for classification on the real TRECVID2003 
and Flickr image datasets, which have been ex tensively evaluated in the context of learning 
finite latent feature models (jChen et al.1 . l20ld l. TRECVID2003 consists of 1078 video key- 



frames, and each example has two types of features - 1894-dimension binary vector of text 
features and 165-dimension HSV color histogram. The Flickr image dataset consists of 3411 
natural scene images about 13 types of animals, including squirrel, cow, cat, zebra, tiger, 
lion, elephant, whales, rabbit, snake, antlers, hawk and wolf, downloaded from the Flickr 
websitqlj. Also, each example has two types of features, including 500-dimension SIFT 
bag-of- words and 634-dimension real- valued features (e.g., color histogram, edge direction 
histogram, and block- wise color moments). Here, we consider the real- valued features only 
by using Gaussian likelihood distributions for x. 

10. http://www.flickr.com/ 
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Table 1: Classification accuracy and Fl scores on the TRECVID2003 and Flickr image 
datasets. 



Model 


TRECVID2003 
Accuracy Fl score 


Flickr 

Accuracy Fl score 


EFH+SVM 
MMH 


0.565 ± 0.0 0.427 ± 0.0 
0.566 ± 0.0 0.430 ± 0.0 


0.476 ± 0.0 0.461 ± 0.0 
0.538 ± 0.0 0.512 ± 0.0 


IBP+SVM 
iLSVM 


0.553 ± 0.013 0.397 ± 0.030 
0.563 ± 0.010 0.448 ± 0.011 


0.500 ± 0.004 0.477 ± 0.009 
0.533 ± 0.005 0.510 ± 0.010 
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Figure 6: Accuracy and Fl score of MMH on the Flickr dataset with different numbers of 
latent features. 



We compare iLSVM with the large-margin Harmonium (MMH) (jChen et al.l . I2OI0I ). 
which was shown to outperform many other latent feature models, and two decoupled ap- 
proache s - EFH+SVM and I BP+SVM. EFH+SVM uses the exponential family Harmonium 
(EFH) (jWelling et al.l . 120041 ) to discover latent features and then lea rns a multi-way SVM 



class ifier. IBP+SVM is similar, but uses an IBP factor analysis model (jGriffiths and Ghahramani 



2OO5I ) to discover latent features. Both MMH and EFH+SVM are finite models and they 
need to pre-specify the dimensionality of latent features. We report their results on classifi; 



catioii accuracy and Fl score (i.e., the average Fl score over all possible classes) (IZhu et al. 
2011bl ) achieved with the best dimensionality in Table [TJ Figure [6] illustrates the perfor- 
mance change of MMH when using different number of latent features, from which we can 
see that K = 40 produces the best performance and either increasing or decreasing K could 
make the performance worse. For iLSVM and IBP+SVM, we use the mean-field infer- 
ence method and present the average performance with 5 randomly initialized runs (Please 
see Appendix A. 8 for the algorithm and initialization details). We perform 5- fold cross- 
validation on training data to select hyperparameters, e.g., a and C (we use the same pro- 
cedure for MT-iLSVM). We can see that iLSVM can achieve comparable performance with 
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Figure 7: (Up) the overall average values of the latent features with standard deviation 
over different classes; and (Bottom) the per-class average values of latent features 
learned by iLSVM on the TRECVID dataset. 




200 



Figure 8: The overall average values of the latent features with standard deviation over 
different classes on the Flickr dataset. 



the nearly optimal MMH, without needing to pre-specify the latent feature dimensioE0, 
and is much better than the decoupled approaches (i.e., IBP+SVM and EFH+SVM). 

It is also interesting to examine the discovered latent features. Figure [7] shows the 
overall average values of latent features and the per-class average feature values of iLSVM 
in one run on the TRECVID dataset. We can see that on average only about 45 features 
are active for the TRECVID dataset. For the overall average, we also present the standard 
deviation over the 5 categories. A larger deviation means that the corresponding feature 
is more discriminative when predicting different categories. For example, feature 26 and 
feature 34 are generally less discriminative than many other features, such as feature 1 
and feature 30. Figure [8] shows the overall average feature values together with standard 
deviation on the Flickr dataset. We omitted the per-class average because that figure is too 

11. We set the truncation level to 300, which is large enough. 
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Figure 9: Six example features discovered iLSVM on the Flickr animal dataset. For each 
feature, we show 5 top-ranked images. 

crowded with 13 categories. We can that as k increases, the probability that feature k is 
active decreases. The reason for the features with stable values (i.e., standard deviations 
are extremely small) is due to our initialization strategy (each feature has 0.5 probability 
to be active). Initializing ip^k as being exponentially decreasing (e.g., like the constructing 
process of tt) leads to a faster decay and many features will be inactive. To examine the 
semantic^^l of each feature. Figure [9] presents some example features discovered on the 
Flickr animal dataset. For each feature, we present 5 top-ranked images which have large 
values on this particular feature. We can see that most of the features are semantically 
interpretable. For instance, feature Fl is about squirrel; feature F2 is about ocean animal, 
which is whales in the Flickr dataset; and feature F4 is about hawk. We can also see that 
some features are about different aspects of the same category. For example, feature F2 
and feature F3 are both about whales, but with different background. 

5.2 Multi-task Learning 

Now, we evaluate the multi-task infinite latent SVM (MT-iLSVM) on several well-studied 
real datasets. 



12. The interpretation of latent features depends heavily on the input data. 
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Table 2: Multi-label classification performance on Scene and Yeast datasets. 



Dataset 


Model 


Acc 


Fl-Micro 


Fl-Macro 




YaXue 


0.5106 


0.3897 


0.4022 




riyusnrai-1 


0.5212 


0.3631 


0.3901 


Yeast 


Piyushrai-2 


0.5424 


0.3946 


0.4112 




MT-IBP+SVM 


0.5475 ± 0.005 


0.3910 ± 0.006 


0.4345 ± 0.007 




MT-iLSVM 


0.5792 ± 0.003 


0.4258 ± 0.005 


0.4742 ± 0.008 




YaXue 


0.7765 


0.2669 


0.2816 




Piyushrai-1 


0.7756 


0.3153 


0.3242 


Scene 


Piyushrai-2 


0.7911 


0.3214 


0.3226 




MT-IBP-FSVM 


0.8590 ± 0.002 


0.4880 ± 0.012 


0.5147 ± 0.018 




MT-iLSVM 


0.8752 ± 0.004 


0.5834 ± 0.026 


0.6148 ± 0.020 



5.2.1 Description of the Data 



Scene and Yeast Data: These d atasets are from the UCI repository, and each data 
example has multiple labels. As in (|Rai and Daume Ilj . I2010I ). we treat the multi-label 
classification as a multi-task learning problem, where each label assignment is treated as a 
binary classification task. The Yeast dataset consists of 1500 training and 917 test examples, 
each having 103 features, and the number of labels (or tasks) per example is 14. The Scene 
dataset consists 1211 training and 1196 test examples, each having 294 features, and the 
number of labels (or tasks) per example for this dataset is 6. 

School Data: This dataset comes from the Inner London Education Authority and 
has been used to study the effectiveness of schools. It consists of examination records 
from 139 secondary schools in years 1985, 1986 and 1987. It is a random 50% sample 
with 15362 students. The dataset is publi cly available and has been extensively evalu- 



ated in various multi-tas k learning methods (|Bakker and Heskes . 2003 : Bonilla et al. . 20081 : 



Zhang and Yeung . 20ld ). where each task is defined as predicting the exam scores of stu- 
dents belonging to a specific school based on four student-dependent features (year of the 
exam, gender, VR band and ethnic group) and four school-dependent features (percentage 
of students eligible for free school meals, percentage of students in VR band 1, school gen- 
der and school denomina tion). In order to cornpare with the above meth ods, we follow the 
same setup described in (jArgvriou et al.l . I2OO7I : Isakker and Heskesl . I2OO3I I and similarly we 
create dummy variables for those features that are categorical forming a total of 19 student- 
dependent features and 8 school-dependent features. We use the same 10 random 

splitsEl 

of the data, so that 75% of the examples from each school (task) belong to the training set 
and 25% to the test set. On average, the training set includes about 80 students per school 
and the test set about 30 students per school. 



13. Available at: http://ttic.uchicago.edu/~argyriou/code/index.html 
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Figure 10: Percentage of explained variance by various models on the School dataset. 



5.2.2 Results 



Scene and Yeast Data: We compare with the closelv related n onparametric Bayesian 

methods, including kernel stick-breaking (YaXue) ( Xue et al. . 200?! ) and the basic and aug- 

ment ed infinite predictor subspace models (i.e., Piyushrai-1 and Piyushrai-2) (iRai and Daume III 

These nonparametric Bayesian models were shown to o utperform the independen t 
Bayesian logistic regression and a single-task pooling approach ( Rai and Daume III . 20101 ). 
We also compare with a decoupled method MT-IBP+SVA^^ that uses an IBP factor analy- 
sis model to find shared latent features among multiple tasks and then builds separate SVM 
classifiers for different tasks. For MT-iLSVM and MT-IBP-I-SVM, we use the mean-field 
inference method in Sec 14.41 and report the average performance with 5 randomly initialized 
runs ( See Appendix A. 7 for initialization details). For comparison with (IRai and Daume III . 
2O10l : IXue et al.l . 120071 ) . we use the overall classification accuracy, Fl-Macro and Fl-Micro 
as performance measures. Table [2] shows the results. On both datasets, MT-iLSVM needs 
less than 50 latent features on average. We can see that the large-margin MT-iLSVM per- 
forms much better than other nonparametric Bayesian methods and MT-IBP-I-SVM, which 

separates the inference of latent features from learning the classi fiers. 

School Data: We use the percentage of explained variance ( Bakker and Heskesl . lioO^ ) 
as the measure of the regression performance, which is defined as the total variance of the 
data minus the sum-squared error on the test set as a percentage of the total variance. 
Since we use the same settings, we can compare with the state-of-the-art results of 

(1) Bayesian multi-task learning (BMTL) (jBakker and Heskesl . l2003l l: 



(2) Multi-task Gaussian processes (MTGP) (|Bonilla et al.l . boosi ) : 

(3) Convex multi-task relationship learning (MTRL) ( Zhang and Yeung . 20ld ): 

and single-task learning (STL) as reported in ( Bonilla et al. . 20081 : Zhang and Yeune . 2O10l ). 



For MT-iLSVM and MT-IBP-I-SVM, we also report the results achieved by using both the 



14. This decoupled approach is in fact an one-iteration MT-iLSVM, where we first infer the shared latent 
matrix Z and then learn an SVM classifier for each task. 
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Figure 11: Sensitivity study of MT-iLSVM: (a) classification accuracy with different a on 
Yeast data; (b) classification accuracy with different C on Yeast data; (c) per- 
centage of explained variance with different a on School data; and (d) percentage 
of explained variance with different C on School data. 



latent features (i.e., Z^x) and the original input features x through vector concatenation, 
and we denote the corresponding methods by MT-iLSVM^ and MT-IBP+SVM^ , respec- 
tively. On average the multi-task latent SVM (i.e., MT-iLSVM) needs about 50 latent 
features to get sufficiently good and robust performance. From the results in Figure [101 we 
can see that the MT-iLSVM achieves better results than the existing methods that have 
been tested in previous studies. Again, the joint MT-iLSVM performs much better than 
the decoupled method MT-IBP-I-SVM, which separates the latent feature inference from the 
training of large-margin classifiers. Finally, using both latent features and the original input 
features can boost the performance slightly for MT-iLSVM, while much more significantly 
for the decoupled MT-IBP-hSVM. 
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Figure 12: Percentage of explained variance and running time by MT-iLSVM with various 
training sizes. 



5.3 Sensitivity Analysis 

Figure [11] shows how the performance of MT-iLSVM changes against the hyper-parameter 
a and regularization constant C on the Yeast and School datasets. We can see that on 
the Yeast dataset, MT-iLSVM is insensitive to both a and C. For the School dataset, 
MT-iLSVM is very insensitive the a, and it is stable when C is set between 0.3 and 1. 

Figure [12] shows how the training size affects the performance and running time of MT- 
iLSVM on the School dataset. We use the first b% {b = 50, 60, 70, 80, 90, 100) of the training 
data in each of the 10 random splits as training set and use the corresponding test data 
as test set. We can see that as training size increases, the performance and running time 
generally increase; and MT-iLSVM achieves the state-of-art performance when using about 
70% training data. From the running time, we can also see that MT-iLSVM is generally 
quite efficient by using mean- field inference. 

Finally, we investigate how the performance of MT-iLSVM changes against the hyper- 
parameters £7^0 ™d Xmn- We initially set ct^q = 1 compute from observed data. 
If we further estimate them by maximizing the objective function, the performance does 
not change much (±0.3% for average explained variance on the School dataset). We have 
similar observations for iLSVM. 

6. Conclusions and Discussions 

We present regularized Bayesian inference (RegBayes), a computational framework to per- 
form post-data posterior inference with a convex regularization on the desired posterior 
distributions. RegBayes is applicable to both directed and undirected graphical models. 
General conjugate results are derived when the posterior regularization is induced from a 
linear operator (e.g., expectation). Furthermore, we particularly concentrate on develop- 
ing two large-margin nonparametric Bayesian models under the RegBayes framework to 
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learn predictive latent features for classification and multi-task learning, by exploring the 
large-margin principle to define posterior constraints. Both models allow the latent di- 
mension to be automatically resolved from the data. The empirical results on several real 
datasets appear to demonstrate that our methods inherit the merits from both Bayesian 
nonparametrics and large-margin learning. 

Regularized Bayesian inference offers a computational framework for considering pos- 
terior regularization in performing nonparametric Bayesian inference. For future work, we 
plan to study other posterior regularization beyond the large-margin constr aints, such as 
posterior constraints defined on manifold structures ( Huh and Fienberg . 20101 ) . and investi- 
gate ho w posterior regul a .rization can be used in other interestin g nonparametric Bayesian 
models (|Beal et al.l l2002l:lTeh et al.'. '2 00fil : fBlei and Frazieil . l2O10l ;i in different contexts, such 
as link prediction (jMiller et al .. 2009) for social network analysis. As we have stated, Reg- 
Bayes can be developed for undirected MRFs. But the inference would be even harder. We 
plan to do a s ystematic inyestig ation along this direction. We have some preliminary results 



presented in (jChen et al.l 
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Appendix A.l: Proof of Lemma [6] 

Proof By definition, (7o(/^) = sup^fz^{xfj, — Cmax(0, x)). We consider two cases. First, if 
/i < 0, we have 

5o(m) — sup(x/U — Cmax(0, x)) = sup xfi = oo. 
Therefore, we have (7o(^) = oo if ^ < 0. Second, if ^ > 0, we have 

9oif^) — sup(x^ — Cx) = I(^ < C). 

x>0 



Putting the above results together, we prove the claim. 



Appendix A. 2: Proof of Lemma [7] 

Proof The proof has a similar structure as the proof of LemmaEl Specifically, by definition, 
the conjugate is 

gl (/i) = sup \ fjJx - gi (x) \ = sup i ujXj - max(xi , • • • ,xl)\. 

We first show that Vi, //j > in order to have finite g\ values. Suppose that 3j, < 0. 
Then, we define 

Qj = {x(^Q : Xj < 0}, and g° = {x € Gj : x^ = 0, if i / j}. (47) 
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Since Q° C Gj C G, we have 

glifi) > sup{/x^x - 51 (x)} > sup{/i^x - 51 (x)} = sup {x^/^j - 0} 



oo. 



Therefore, gl{fi) = oo if 3j, fij < 0. 

Now, we consider the second case, where Vi,^j > 0. We can easily show that 

Vx G ^, /i^x-5(i(x) < ^/ij5i(x) -5i(x). 

i 

Therefore 



gl in) < sup I (V] - C) max(x) I . 



Moreover, for any xq that makes ip{fJ,) '= sup^gg — C)max(x)| achieve its sper- 

mium, there exists x' (e.g., x'^ = max(xo) and x^- = 0, Vj ^ 1), which gives 

/^^x - 51 (x) = 

Therefore, we have glifJ-) > 

glifi) = supUV^i - C)max(x)| = l( < C). 

Putting the above results together proves the claim. ■ 



Appendix A. 3: Proof of Lemma [8] 
Proof By definition, the conjugate is 

52(m) = sup < ^ixi + /U2X2 — C max(0, xi — ci) — C max(0, X2 — C2) k 

= sup < (/ii — fi2)xi — C max(0, xi — ci) — C max(0, —xi — C2) 

sup \ (/i2 — fJ'i)x2 — C max(0, —X2 — ci) — C max(0, X2 — C2)\- 



X2i 



We can see that only the difference fii — fi2 or fi2 — A*i is directly involved in 52 • Without 
loss of generality, we can fix one at and assume the other one is non-negative. Thus, we 
have Hifi2 = 0. Then, we consider two cases. 
If /ii = (and /i2 > 0), we have 



52 (m) = sup <^ ^22:2 - C max(0, -X2 - ci) - max(0, X2 - C2) \ 

= C2^2 + sup < fi2Z — C max(0, —z — C2 — ci) — max(0, z) >. 
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Using the results in the proof of Lemma [6l we can get 

g*2{fl)=C2fi2 + K0<fi2<C). 

Similarly, by symmetry, if /U2 = (and /ii > 0), we have 

52*(/x) = ci/ii + I(0<^i <C7). 
Putting the above results together, we get the conclusions in the lemma. 



Appendix A. 4: Proof of Lemma [9] 

Proof Similar structure as the proof of Lemma [TOl In this case, the linear expectation 
operator is E : Vp^oh ~^ and the elements of Ep evaluated at the nth example is a two 
dimensional vector /x„ = {^1,^2)-, where /^i + = and 

W =IEg({e„,^„„,r,}|e)[^^^n]. (48) 

Then, by the fact that max(0, |x| — c) = max(0,x — c) + max(0, —x — c) and using the (72 
function defined in Lemma [H we have 

g{Ep) =^7^e(g({6l„,z„m,?7}|e)) = ^g2(^- ^ln■, -Vn + e,yn + ej. 

n 

Therefore 

fi'*(^) = X]^2(-^n;-2/n + e,yn + e) 

n 

and 

g*{-u) = '^g2iun; -Vn + e,yn + e). 

n 

By the results in Lemma [5] and Lemma El we can derive the conjugate and the optimum 
solution of q. The optimum solution of Q is due to Lemma [TJ Note that the constraints are 
not directly dependent on 0. ■ 



Appendix A. 5: Proof of Lemma 1101 

Proof By definition, we have g{Ep) ^= 7ll^(^p{Z,r])) = X]n5i(^n ~ Ep{n)). Let /^^ 
Ep{n). We have the conjugate 

g*{u) =sup|^^/i-^c/i(^^-/x„)} 

^ n 

= Vsup|a;^/x„ -5ri(^^ - /2„)| 

= V sup - l^n) - gi{l^n)\ 

= Y.Ull^,+gl{-usn) 
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Thus, 

n 

Using the results of Lemma [5] proves the claim. 



Appendix A. 6: Proof of Lemma 1111 

Proof Similar structure as the proof of Lemma [101 In this case, the linear expectation 
operator is E : Vproh — ^ M^'" 1-^™! and the element of Ep evaluated at the nth example for 
task m is 

Ep{n, m) =^ yran^p{Z,ri)['^VmV^mn = Ep(z,r,) [l/mn (Zr7^)'^Xm„] . (49) 

Then, let 50 : M ^ M be a function defined in Lemma [6l We have 

g(i?p)^=^f7^f^(p(Z,r,)) = 9o{l-Ep{n,m: 

Let fj, = Ep. By definition, the conjugate is 

g* (u) = snp - ^ 50(1 -Atmn)} 

= ^ sup \uJran^J'ran " 5o(l " ^^mn)\ 
= ^ Slip |tJmn(l - fmn) - 5'o(^'mn) I 

g*{-u)= ^ -Wmn +5^o(Wmn))• 



ThuS, 



By the results in Lemma[5]and LemmaEl we can derive the conjugate of the problem (I4ip . 



Appendix A. 7: Inference for MT-iLSVM 

In this section, we provide the deviation of the inference algorithm for MT-iLSVM, which 
is outlined in Algorithm [2] and detailed below. 

For MT-iLSVM, the model M. consists of all the latent variables (iv, W, Z, 77). Let 

(p) = Ep[logp(x mn|Z, w^^, A^^)] be the expected data likelihood. Then, under the 
truncated mean- field assumption (I44j) . we have 



^ ^ _ X.],„X^„ - 2x^ „Ep[ZWn,n] -FEp[w^„UWn,n] I? log(27rA^„) 



2A2 
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where x^.„Ep[Zwm„] = Y.k^Zmi' .k, i> .k = ii'ik ■ ■ ■ i^DkV is the kih column of i/; = E[Z]; 

rnn't'mn^ jk + ^ ^ ^kki^'^mn ~^ ^mn'^rnn)] 

j<k k 

and U = E[Z ' Zl is a K X K matrix, whose element is 



Ed V'diV'd;-, otherwise. 



For the KL-divergence term, we have KL(p(7W)||7r(>l)) = KL(p(iy)||7r(i/))+KL(p(W)||7r(W))+ 
KL(p(Z)||7r(Z)) + KL{p{r])\\TT(ri)), where the individual terms are 



K 



KL{p{u)\\Tr{v)) = ^ (^(7fei - a){ip{-yki) - ipilki +7fc2)) + {lk2 - l)(V'(7fc2) - iphki +7fc2)) 

k=l 

-log— -i^loga, 

r(7fci + 7fc2) ^ 

k k 

KL(p(Z)||7r(Z)) = ^ ( - i^dk^Ep[\oguj] - (1 - V*)Ep[log(l - [] uj)] 

dk j=l j=l 



KL(p(W)||7r(W)) = 



+ 1pdk log Ipdk + (1 - V'dfc) log(l - 

Kcj"^ +<I.T * i^(l+log5 



2 

mn ~ 



where '(/'(■) is the digamma function and EppogWj] = '0(7ii)~'0(7ji+7i2)- For KL(p(?7)||7r(?7)), 
we do not need to write it explicitly, as we shall see. Finally, the effective discriminant func- 
tion is 



K 



k=l 



All the above terms can be easily comp uted, except the term IE p[log(l — 11^=1 ^j)]- Here, 
we adopt the multivariate lower bound ( Doshi-Velez et al. . 20091 ) 



k-l k 



Ep[log(l - Y[ ^j)] > Y qkm4'{lm2) + X] ( X] 1kn)lp{ln 
j=l m=l m=l n=m+l 

k k 

-^(^ Qkn)lp(.lml + 7m2) + T-L{qk.), 
m=l n—m 

where the variational parameters qk, = {qki ■ ■ • Qkk)~^ belong to the /c-simplex, and 'H{qk.) is 
the entropy of Qk.- The tightest lower bound is achieved by setting qk. to be the optimum 
value 

^ m— 1 m 

Qkm = Y' (V'(7m2) + X] V'(7nl) " X] ^^'^^^ + ' (^0) 

^ n=l n=l 
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Algorithm 2 Inference Algorithm of MT-iLSVM 

1: Input: data V = {{yimn,ymn)}m,n&xi^ U {yimn}m,n&x^,, Constants a and C 

2: Output: distributions p^i^), p(Z), p(W), p{r]) and hyper-parameters ™d 

3: Initiahze 7^1 = a, 7fc2 = 1, V'dfc = 0.5 + e, where e ~ A/'(0, 0.001), = 0, cr^„ = 

(7^0 = 1) /^m = 0' '^mn Computed from V. 
4: repeat 
5: repeat 

6: update (7^1,7^2) using Eq. ([52]), Ml <k < K; 

7: update and ct^„ using Eq. ([STI) . \/m,yn,yi < k < K; 

8: update Vdfc using Eq. ([SSD, VI < d < VI < A; < A'; 

9: until relative change of L is less than r (e.g., le~^) or iteration number is T (e.g., 
10) 

10: for m = 1 to M do 

11: solve the dual problem ()54p using a binary SVM learner. 
12: end for 

13: update the hyper-parameters cr^o using Eq. (j55]) and using Eq. ([56]) . (Optional) 
14: until relative change of L is less than r' (e.g., le~^) or iteration number is T' (e.g., 20) 



where is a normalization factor to make qk. be a distribution. We denote the tightest 
lower bound by Replacing the term Ep[log(l —11^=1 ^i)] with its lower bound we 
can have an upper bound of KL{p{M.)\\tt{M.)) and we denote this upper bound by C{p). 

With the above terms and the upper bound jC{p), we can implement the general proce- 
dure outlined in Algorithm [1] to solve the MT-iLSVM problem. Specifically, the inference 
procedure iteratively solves the following steps, as summarized in Algorithm [2) 

Infer pif), p(Z) and p(W): For p(W), since both the prior 7r(W) and p(W) are 
Gaussian, we can easily derive the update rules, similar as in Gaussian mixture models 



" \2 \ 2 ' \2 



+ (51) 



2 _ / 1 1 \ ^ Ufcfc\ ~^ 

'^mn - \ —r~ + T2~ 



For p{i'), we have the update rules similar as in dPoshi-Velez et al.l . liooil ) . that is 



K D K D m 

m=k d=l m=k+l d=l i=k+l 

K D 

7fe2 = 1 + ^ (-D - ^ ll^dm)qmk- 
m=k d=l 

For p(Z), we have the mean-field update equation as 

i^dk = -r—^—g-, (53) 
1 + e "rf* 
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where 

k 

^dk = Y^^A^ogv,] -CI- Y^-—({Kal^ + (./.L)') 



2\2 

j=l mn ™" 

~'^-^mn4'mn ^ ^ ^ 4'inn4'mn'4^dj] + ^ ^ ymn^p[Vmk\Xmn- 
jj^k m,n£X^ 

Infer p{ri) and solve for u: By the convex duahty theory, we have the solution 

P(t7) OC 7r(j7) exp I ^ ymnU^mnVm'^'^ ^mn^ 
M 

= n '^('^m)ew[Vm{ X] ymnUJmn1p^y^mn)y 

m=l rieT[^ 

Therefore, we can see that although we did not assume piij) is factorized, we can get the 
induced factorization form p{r]) = Y\rnP(''lm)^ where 

p(r/„) oc 7r(r/„)exp|r7^( ^ Vmn^^mntp^ ^mn)y 
Here, we assume '/r(r/^) is standard normal. Then, we have p{r]^) = M{r]^\fi^, I), where 

f^m — ^ ^ Umn^mn'^ ^mn^ 

The optimum dual parameters can be obtained by solving the following M independent 
dual problems 

max yZ ^rnn S.t..: < Wmn < 1, Vn G It?; (54) 

which (and its primal form) can be efficiently solved with a binary SVM solver, such as 
SVM-light. 

As we have stated, the hyperparameters <Tg and can be set a priori or estimated 
from the data. The empirical estimation can be easily done with closed form solutions. For 
MT-iLSVM, we have 

2 _ X^n=l('^'^mn + ^Inn^mn) 
'''"O" KN~r, ^^^^ 

x2 _ ^mn^mn — 2x^„Ep[ZWmn] + Ep[w^„UWmn] 

^mn — Y) ■ ^ ' 

Appendix A. 8: Inference for Infinite Latent SVM 

In this section, we develop the inference algorithm for iLSVM based on the stick-breaking 
construction of the IBP prior. The algorithm is outlined in Algorithm [3l 
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Similar as in the inference for MT-iLSVM, we make the additional constraint about the 
feasible distribution 

K K 

p(iv,W,Z,r7) =p(t7MW|cD,S) J] ( J]p(z„fc|^„fc)) \{p{uk\lk), 

n k=l k=l 

where -R' is the truncation level; p(W|<I>,S) = Y\^M(W .k,(^\l)'-, p[znk\(t>nk) = Bernoulli 
and p{yk\lk) — Beta(7fci,7fc2)- Then, we solve the unconstrained problem using convex du- 
ality with dual parameters being u. Let Ln{p) =^ Ep[logp(x„|z„, W)]. We have 



x„ - 2x^ ^Ejzn] ' + Ep[zn Az^ ] D log(27r<o^ 
2a2 



where A =^Ep[W^W] \s a K -k K matrix; xJ<l>Ep[z„]^ = 2 ^nfc(xj$.fc); and 

Ep[z„Az^] = 2 ^ ^Jnjlpnk^jk + ^ 1pnk{Dal + Afcfc). 
j<k k 

The effective discriminant function is f{y,:x.n) = 'l2k'^p[Vy]i^nk- Again, for computational 
tractability, we need the lower bound Cj^ of the term Ep[log(l — 11^=1 ^j)]- Using this 
lower bound, we can get an upper bound of the KL-divergence term. Then, the inference 
procedure iteratively solves the following steps: 

Infer p{i'), p(Z) and p(W): For p(W), we have the update rules 

= E - E (i + E (58) 

For p{i'), we have the update rules similar as in (jPoshi-Velez et all lionfll ) . that is, 

K N K N m 

7fel = a + ^ ^ V'nm + ^ (^-X]V'nm)(^ Qmi) (59) 
m=kn=l m=k+l n=l i=k+l 

K N 

7fc2 = 1 + ^ (A^ - ^ 'lpnm)qmk, 
m=k n=l 

where q,k is computed in the same way as in Eq. (f50|) . For p(Z), the mean-field update 
equation for ip is 

V'nfc = (60) 

where 

= ^Ep[log^;,] - CUp) - ^{Dal + $1$.^) 



'^"■0 jy^k y 
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Algorithm 3 Inference Algorithm of iLSVM 
1: Input: data V = {(x„, y„)}„gx,^ U {x„}„gXj^j, constants a and C 
2: Output: distributions p(Z), p(W), p{r]) and hyper-parameters (Tq and cr^g 
3: Initiahze 7^1 = a, 7^2 = 1, tpnk = 0.5 + e, where e ~ A/'(0, 0.001), ^ k = 0) '^1 = <^o ~ 

/X = 0, cj^o is computed from V. 
4: repeat 
5: repeat 

6: update (7^1,7/02) using Eq. ([59]) . VI < A; < K; 
7: update $.a; and cr^ using Eq. ([58]), VI < A; < if; 
8: update V'nfc using Eq. ([60]) . Vn € Itr, VI < k < K; 

9: update V'nfc using Eq. (f60]l . but doesn't have the last term, Vn G Xtst, VI < k < 
K; 

10: until relative change of L is less than r (e.g., le~^) or iteration number is T (e.g., 
10) 

11: solve the dual problem (I6ip (or its primal form) using a multi-class SVM learner. 
12: update the hyper-parameters (Tq using Eq. (j62]) and fj^g using Eq. (j63|) . {Optional) 
13: until relative change of L is less than r' (e.g., le~^) or iteration number is T' (e.g., 20) 



For testing data, i^nk does not have the last term because of the absence of large-margin 
constraints. 

Infer ^(^7) and solve for u: By the convex duality theory, we have 

Piv) « 7r(j7)exp |t7'^( ^ ^ a;^;Ep[g(y„, x„, z„) - g(y, x^, z„)])|. 

neitr y 

For the standard normal prior vr(?7), we have that q{ri) is also normal, with mean 

At= X] X]^^^p[s(yn,x„,z„) - g(y,x„,z„)] 

neitr y 

and identity covariance matrix. The dual problem is 

max ^J2^n s-t-- : < wj( < C, Vn G Xtr, (61) 

nelti y y 

which (and its primal form) can be efficiently solved with a multi-class SVM solver. 

Similar as in MT-iLSVM, the hyperparameters Uq and cr^g can be set a priori or es- 
timated from the data. The empirical estimation can be easily done with closed form 
solutions. For iLSVM, we have 

. _ EtiiD^l + '^l'^k) .... 

KD ^^^^ 
2 _ ^l^n - 2x^^Ep[z„.]T + Ep[z„Az^] 

^nO — ^ ■ 
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